We present the Group Propagation Vision Transformer (GPViT): a novel nonhierarchical (i.e. non-pyramidal) transformer model designed for general visual recognition with high-resolution features. High-resolution features (or tokens) are a natural fit for tasks that involve perceiving fine-grained details such as detection and segmentation, but exchanging global information between these features is expensive in memory and computation because of the way self-attention scales. We provide a highly efficient alternative Group Propagation Block (GP Block) to exchange global information. In each GP Block, features are first grouped together by a fixed number of learnable group tokens; we then perform Group Propagation where global information is exchanged between the grouped features; finally, global information in the updated grouped features is returned back to the image features through a transformer decoder. We evaluate GPViT on a variety of visual recognition tasks including image classification, semantic segmentation, object detection, and instance segmentation. Our method achieves significant performance gains over previous works across all tasks, especially on tasks that require high-resolution outputs, for example, our GPViT-L3 outperforms Swin Transformer-B by 2.0 mIoU on ADE20K semantic segmentation with only half as many parameters. Code and pre-trained models are available at https://github.com/ChenhongyiYang/GPViT .
translated by 谷歌翻译
We revisit a simple Learning-from-Scratch baseline for visuo-motor control that uses data augmentation and a shallow ConvNet. We find that this baseline has competitive performance with recent methods that leverage frozen visual representations trained on large-scale vision datasets.
translated by 谷歌翻译
Privacy in AI remains a topic that draws attention from researchers and the general public in recent years. As one way to implement privacy-preserving AI, differentially private learning is a framework that enables AI models to use differential privacy (DP). To achieve DP in the learning process, existing algorithms typically limit the magnitude of gradients with a constant clipping, which requires carefully tuned due to its significant impact on model performance. As a solution to this issue, latest works NSGD and Auto-S innovatively propose to use normalization instead of clipping to avoid hyperparameter tuning. However, normalization-based approaches like NSGD and Auto-S rely on a monotonic weight function, which imposes excessive weight on small gradient samples and introduces extra deviation to the update. In this paper, we propose a Differentially Private Per-Sample Adaptive Clipping (DP-PSAC) algorithm based on a non-monotonic adaptive weight function, which guarantees privacy without the typical hyperparameter tuning process of using a constant clipping while significantly reducing the deviation between the update and true batch-averaged gradient. We provide a rigorous theoretical convergence analysis and show that with convergence rate at the same order, the proposed algorithm achieves a lower non-vanishing bound, which is maintained over training iterations, compared with NSGD/Auto-S. In addition, through extensive experimental evaluation, we show that DP-PSAC outperforms or matches the state-of-the-art methods on multiple main-stream vision and language tasks.
translated by 谷歌翻译
Product ranking is the core problem for revenue-maximizing online retailers. To design proper product ranking algorithms, various consumer choice models are proposed to characterize the consumers' behaviors when they are provided with a list of products. However, existing works assume that each consumer purchases at most one product or will keep viewing the product list after purchasing a product, which does not agree with the common practice in real scenarios. In this paper, we assume that each consumer can purchase multiple products at will. To model consumers' willingness to view and purchase, we set a random attention span and purchase budget, which determines the maximal amount of products that he/she views and purchases, respectively. Under this setting, we first design an optimal ranking policy when the online retailer can precisely model consumers' behaviors. Based on the policy, we further develop the Multiple-Purchase-with-Budget UCB (MPB-UCB) algorithms with $\~O(\sqrt{T})$ regret that estimate consumers' behaviors and maximize revenue simultaneously in online settings. Experiments on both synthetic and semi-synthetic datasets prove the effectiveness of the proposed algorithms.
translated by 谷歌翻译
在这项工作中,我们解决了共同跟踪手对象姿势并从野外深度点云序列重建形状的具有挑战性,HandTrackNet,以估计框架间的手动运动。我们的HandTrackNet提出了一个新型的手姿势构成典型化模块,以简化跟踪任务,从而产生准确且稳健的手工关节跟踪。然后,我们的管道通过将预测的手关节转换为基于模板的参数手模型mano来重建全手。对于对象跟踪,我们设计了一个简单而有效的模块,该模块从第一帧估算对象SDF并执行基于优化的跟踪。最后,采用联合优化步骤执行联合手和物体推理,从而减轻了闭塞引起的歧义并进一步完善了手姿势。在训练过程中,整个管道仅看到纯粹的合成数据,这些数据与足够的变化并通过深度模拟合成,以易于概括。整个管道与概括差距有关,因此可以直接传输到真实的野外数据。我们在两个真实的手对象交互数据集上评估我们的方法,例如HO3D和DEXYCB,没有任何填充。我们的实验表明,所提出的方法显着优于先前基于深度的手和对象姿势估计和跟踪方法,以9 fps的帧速率运行。
translated by 谷歌翻译
在语义细分中,将高级上下文信息与低级详细信息集成至关重要。为此,大多数现有的分割模型都采用双线性启动采样和卷积来具有不同尺度的地图,然后以相同的分辨率对齐。但是,双线性启动采样模糊了这些特征地图和卷积中所学到的精确信息,这会产生额外的计算成本。为了解决这些问题,我们提出了隐式特征对齐函数(IFA)。我们的方法的灵感来自隐式神经表示的快速扩展的主题,在该主题中,基于坐标的神经网络用于指定信号字段。在IFA中,特征向量被视为表示2D信息字段。给定查询坐标,附近的具有相对坐标的特征向量是从多级特征图中获取的,然后馈入MLP以生成相应的输出。因此,IFA隐含地将特征图在不同级别对齐,并能够在任意分辨率中产生分割图。我们证明了IFA在多个数据集上的功效,包括CityScapes,Pascal环境和ADE20K。我们的方法可以与各种体系结构的改进结合使用,并在共同基准上实现最新的计算准确性权衡。代码将在https://github.com/hzhupku/ifa上提供。
translated by 谷歌翻译
视频通常将流和连续的视觉数据记录为离散的连续帧。由于存储成本对于高保真度的视频来说是昂贵的,因此大多数存储以相对较低的分辨率和帧速率存储。最新的时空视频超分辨率(STVSR)的工作是开发出来的,以将时间插值和空间超分辨率纳入统一框架。但是,其中大多数仅支持固定的上采样量表,这限制了其灵活性和应用。在这项工作中,我们没有遵循离散表示,我们提出了视频隐式神经表示(videoinr),并显示了其对STVSR的应用。学到的隐式神经表示可以解码为任意空间分辨率和帧速率的视频。我们表明,Videoinr在常见的上采样量表上使用最先进的STVSR方法实现了竞争性能,并且在连续和训练的分布量表上显着优于先前的作品。我们的项目页面位于http://zeyuan-chen.com/videoinr/。
translated by 谷歌翻译
分组和识别是视觉场景理解的重要组成部分,例如,用于对象检测和语义分割。借助端到端的深度学习系统,图像区域的分组通常通过像素级识别标签的自上而下的监督隐式进行。取而代之的是,在本文中,我们建议将分组机制恢复到深层网络中,从而使语义片段仅在文本监督下自动出现。我们提出了一个分层分组视觉变压器(GroupVit),它超出了常规的网格结构表示,并学会了将图像区域分组为逐渐更大的任意形状段。我们通过对比度损失在大规模图像文本数据集上与文本编码器共同训练小组vit。只有文本监督并且没有任何像素级注释,GroupVit就学会了将语义区域分组在一起,并以零拍的方式成功地将语义分割的任务转移到语义分割的任务,即,而没有任何进一步的微调。它在Pascal VOC 2012上获得了52.3%MIOU的零拍摄精度和Pascal上下文数据集中的22.4%MIOU,并竞争性地表现为需要更高水平监督的最先进的转移学习方法。我们在https://github.com/nvlabs/groupvit上开放代码。
translated by 谷歌翻译
大量的电子健康记录(EHR)在改善医疗保健方面产生了巨大的潜力。临床代码(结构化数据)和临床叙述(非结构化数据)是EHR中的两个重要文本模式。临床代码传达医院期间的诊断和治疗信息,临床注释带有患者遭遇的临床提供者的叙述。它们不孤立地存在,并且可以在大多数现实生活中的临床情况下相互补充。但是,大多数现有的面向EHR的研究要么集中于特定模式,要么以直接方式整合来自不同模态的数据,这忽略了它们之间的内在相互作用。为了解决这些问题,我们提出了一个名为MEDM-PLM的医学多模式预训练的语言模型,以了解对结构化和非结构化数据的增强EHR表示。在MEDM-PLM中,首先采用了两个基于变压器的神经网络组件来从每种模式中学习代表性特征。然后引入跨模块模块以建模其相互作用。我们在模拟III数据集上预先训练MEDM-PLM,并验证了该模型对三个下游临床任务的有效性,即药物建议,30天的再入院预测和ICD编码。与最先进的方法相比,广泛的实验证明了MEDM-PLM的功率。进一步的分析和可视化表明了我们的模型的鲁棒性,这有可能为临床决策提供更全面的解释。
translated by 谷歌翻译
我们向多人3D运动轨迹预测提出了一种新颖的框架。我们的主要观察是,人类的行动和行为可能高度依赖于其他人。因此,不是以隔离预测每个人类姿势轨迹,我们引入了一种多范围变压器模型,该模型包含用于各个运动的局部运动和用于社交交互的全局范围编码器。然后,通过将相应的姿势作为查询来参加本地和全球范围编码器特征,对变压器解码器对每个人进行预测。我们的模型不仅优于长期3D运动预测的最先进的方法,而且还产生了不同的社交互动。更有趣的是,我们的模型甚至可以通过自动将人分为不同的交互组来同时预测15人运动。具有代码的项目页面可在https://jiahunwang.github.io/mrt/处获得。
translated by 谷歌翻译